Red Wine quality EDA

## [1] "/Users/haneen/Documents/DA/p5 "
## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 
##         8.4         8.5         8.7         8.8           9        9.05 
##           2           1           2           2          30           1 
##         9.1         9.2 9.233333333        9.25         9.3         9.4 
##          23          72           1           1          59         103 
##         9.5        9.55 9.566666667         9.6         9.7         9.8 
##         139           2           1          59          54          78 
##         9.9        9.95          10 10.03333333        10.1        10.2 
##          49           1          67           2          47          46 
##        10.3        10.4        10.5       10.55        10.6        10.7 
##          33          41          67           2          28          27 
##       10.75        10.8        10.9          11 11.06666667        11.1 
##           1          42          49          59           1          27 
##        11.2        11.3        11.4        11.5        11.6        11.7 
##          36          32          32          30          15          23 
##        11.8        11.9       11.95          12        12.1        12.2 
##          29          20           1          21          13          12 
##        12.3        12.4        12.5        12.6        12.7        12.8 
##          12          13          21           6           9          17 
##        12.9          13        13.1        13.2        13.3        13.4 
##           9           6           2           1           3           3 
##        13.5 13.56666667        13.6          14        14.9 
##           1           1           4           7           1
## Observations: 1,599
## Variables: 13
## $ X                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This report explores the red wine quality dataset which contains 1599 observation and 13 variables. The aim of the project is to represent the effects of Acids, total sulfur dioxide, free sulfur dioxide, chlorides, pH, density, sulphates, alcohol, and residual sugar on wine quality by reviewing their relationships and understanding their structure using R programming language in RStudio.

Univariate Plots Section

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Quality distribution shows the most of red wines quality between 5 and 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

alcohol distribution is right skewed and it has one peak at approximately between 9 and 10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

fixed acidity graph is a positively distribution and have two peaks at 7 and 8 with a few outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

volatile acidity graph seems normal distribution with an outliers, most wines contain less than 0.8 g/liter

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The graph shows the most of wines contains 0.5 g /L or less of citric acid and two peaks at 0 and around 0.46 with a few outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

residual sugar graph is right skewed with high peak at around 0.2 with high outliers and the mean is between median and 3rd Qu.

## [1] 0.85
## [1] 3.65

The distribution is normal distribution after remove outliers

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

free sulfur dioxide graph is right skewed with high peak around 4 to 6 75% of wines contain more than 20 g/liter of free sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

total sulfur dioxide distribution is similar to free sulfur dioxide with high outliers.

fixed outlier using log10 function it seems more normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides distribution is a right skewed and most wines contains less than 0.2 g/Liter of chlorides, with a very high outliers.

## [1] 0.04
## [1] 0.12

After fixing outliers chlorides distribution seems to symmetric distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Density diagram has a normal distribution with mean = 0.9967 and median = 0.9968 which is close together.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH distribution also normal looks like density distribution most of wines contain rate of pH between 3 to 3.5 and the mean and median too close.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

sulphates graph is right skewed distribution with median = 0.6200, and a high peak around 0.563

Univariate Analysis

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions?

Bivariate Plots Section

First of all, This is a matrix plot to look at relationships between the variables by correlation value.

Dark blue is indicates to strong positive correlation light blue is a weak positive correlation. the same with orange, dark orange is strong negative correlation and light orange to a weak negative correlation.

The observations from matrix graph is:

The correlation between quality and alcohol is a positive. The amount of alcohol affects the level of wine quality. The higher the alcohol, the quality is high.

Acids effect on wine quality: volatile acidity is negative correlations and positive correlations with citric acid.

Density distribution in all quality degrees is similar it seems does not affect.

volatile acidity and pH are strong negative correlation with citric acid.

Fixed acidity with citric acid is a strong positive correlation.

pH with fixed acidity is strong negative correlation, and density with fixed acidity is strong positive correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

High alcohol and sulphates produced high quality wine

This graph show the correlation between alcohol and density is negative. quality in 5 degree has the higher density and lower alcohol. In general most high qulaity wines have density high and alcohol low.

Low sulfer produced high quality of wine.

high quality wines is high Alcohol and low pH.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
- High in alcohol and low in density, sulfur dioxide, and pH is indicates to high-quality wine.

Were there any interesting or surprising interactions between features?

Final Plots and Summary

Plot One

Obviously, alcohol is the most ingredient influence the quality of the wines, a high amount of alcohol means high-quality wine, the most wines in the dataset contain about 12 to 14 g/liter of alcohol.

Plot Two

It is positive correlation between alcohol, sulphates, and quality, high alcohol and sulphates indicate to high quality wine.

Sulphates is a harmless ingredient contrary to popular belief.

Plot Three

Alcohol and Total Sulfur Dioxide have an inverse relationship on the quality of the wines, a high amount of alcohol and small amount of Sulfur means high-quality wine. that makes sense because of Sulfur damage to human health.


Reflection

The red wine quality dataset contains information on almost 1599 variety of wines across 12 variables. I started by understanding the relationships the chemical elements and wines. First I represented the data in univariate plots which is making it easier to understand and visualize data for each element, I found the quality divider to segments from 3 to 8 and the most quality of wines in the dataset between 5 and 6. Alcohol was the high influencer on the quality, As for the acids their distribution was normal, but the interesting was most data contains 0 of Citric Acid. For the residual sugar and chlorides contains in small amounts and the rest of elements their distribution was normal. Then I represented the data for two variables in bivariate plots, to visualize the relationships especially with quality. After that, I represented data for two variables or more.

of the observations which aroused my interest was a high amount of total sulfur means that an increase in the quality of wines, according to my information is a harmful element, but after searching I found it was just myths there is no truth to it.

I struggled to understand the relationships between chemicals and wines, so I searched a lot in google about wine making, this process took a long time.

In future work, I would like to compare white wine with red wine to discover which best, also I would like to add a new variable ‘price’ it will be interesting.

Reference:

https://en.wikipedia.org/wiki/Acids_in_wine https://en.wikipedia.org/wiki/Sweetness_of_wine#Residual_sugar https://www.dummies.com/food-drink/drinks/wine/how-to-discern-wine-quality/ https://vinepair.com/articles/chemical-compounds-wine-taste-smell/ https://www.youtube.com/watch?v=jxUiIFj2l-s https://vinepair.com/?s=wine+citric+acid&submit=Search https://www.thekitchn.com/the-truth-about-sulfites-in-wine-myths-of-red-wine-headaches-100878